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Abstract 

The problem is sequence prediction in the foUowing setting. A se- 
quence xi, . . . , x„, . . . of discrete- valued observations is generated accord- 
ing to some unknown probabilistic law (measure) fi. After observing each 
outcome, it is required to give the conditional probabilities of the next 
observation. The measure fi belongs to an arbitrary class C of stochastic 
processes. We are interested in predictors p whose conditional probabil- 
ities converge to the "true" /^-conditional probabilities if any G C is 
chosen to generate the data. We show that if such a predictor exists, then 
a predictor can also be obtained as a convex combination of a countably 
many elements of C In other words, it can be obtained as a Bayesian 
predictor whose prior is concentrated on a countable set. This result is 
established for two very different measures of performance of prediction, 
one of which is very strong, namely, total variation, and the other is very 
weak, namely, prediction in expected average KuUback-Leibler divergence. 



1 Introduction 

Given a sequence xi, . . . ,Xn of observations Xi S X, where X is a finite set, we 
want to predict what are the probabiHties of observing Xn+i — x for each x £ X, 
before Xn+i is revealed, after which the process continues. It is assumed that 
the sequence is generated by some unknown stochastic process /i, a probabihty 
measure on the set of one-way infinite sequences X°°. The goal is to have 
a predictor whose predicted probabilities converge (in a certain sense) to the 
correct ones (that is, to /i-conditional probabilities). In general this goal is 
impossible to achieve if nothing is known about the measure /x generating the 
sequence. In other words, one cannot have a predictor whose error goes to 
zero for any measure fi. The problem becomes tractable if we assume that the 
measure fi generating the data belongs to some known class C. The questions 
addressed in this work are a part of the following general problem: given an 
arbitrary set C of measures, how can we find a predictor that performs well 
when the data is generated by any /i G C, and whether it is possible to find such 
a predictor at all. An example of a generic property of a class C that allows 
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for construction of a predictor, is that C is countable. Clearly, this condition is 
very strong. An example, important from the applications point of view, of a 
class C of measures for which predictors are known, is the class of all stationary 
measures. The general question, however, is very far from being answered. 

The contribution of this work to solving this question is in that we we provide 
a specific form in which to look for a solution to the general problem. More 
precisely, we show that if a predictor exists, then a predictor can also be obtained 
as a weighted sum of a countably many elements of C. This result can also be 
viewed as a justification of the Bayesian approach to sequence prediction: if 
there exists a predictor which predicts well every measure in the class, then there 
exists a Bayesian predictor (with a rather simple prior) that has this property 
too. In this respect it is important to note that the result obtained about such 
a Bayesian predictor is pointwise (holds for every /i in C), and stretches far 
beyond the set its prior is concentrated on. 

The motivation for studying predictors for arbitrary classes C of processes is 
two-fold. First of all, prediction is a basic ingredient for constructing intelligent 
systems. Indeed, in order to be able to find optimal behaviour in an unknown 
environment, an intelligent agent must be able, at the very least, to predict how 
the environment is going to behave (or, to be more precise, how relevant parts 
of the environment are going to behave) . Since the response of the environment 
may in general depend on the actions of the agent, this response is necessarily 
non-stationary for explorative agents. Therefore, one cannot readily use pre- 
diction methods developed for stationary environments, but rather has to find 
predictors for the classes of processes that can appear as a possible response of 
the environment. 

Apart from this, the problem of prediction itself has numerous applications 
in such diverse fields as data compression, market analysis, bioninformatics, and 
many others. It seems clear that prediction methods constructed for one appli- 
cation cannot be expected to be optimal when applied to another. Therefore, 
an important question is how to develop specific prediction algorithms for each 
of the domains. In order to do this, the first step is to understand for which 
classes of problems (i.e. sets of measures generating the data) a predictor exists. 

Prior work. As it was mentioned, if the class C of measures is countable 
(that is, if C can be represented as C := {/ife : k £ N}), then there exists a 
predictor which performs well for any /i G C. Such a predictor can be obtained 
as a Bayesian mixture ps '■— J^ken'^kfJ.k, where Wk are summable positive 
real weights, and it has very strong predictive properties; in particular, ps 
predicts every /u G C in total variation distance, as follows from the result of 
[Blackwell and Dubins, 1962| . Total variation distance measures the difference 
in (predicted and true) conditional probabilities of all future events, that is, not 
only the probabilities of the next observations, but also of observations that 
are arbitrary far off in the future (see formal definitions below) . In the context 
of sequence prediction the measure ps was first studied by [Solomonoff, 1978] . 
Since then, the idea of taking a convex combination of a finite or countable class 
of measures (or predictors) to obtain a predictor permeates most of the research 
on sequential prediction (see, for example, |Cesa-Bianchi and Lugosi, 2006| ) and 
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some related topics in AI [Hutter, 2005[|Ryabko and Hutter, 2008a . In practice 
it is clear that, on the one hand, countable models are not sufficient, since al- 
ready the class S [0, 1] of Bernoulli i.i.d. processes, where p is the probabil- 
ity of 0, is not countable. On the other hand, prediction in total variation can be 
too strong to require; predicting probabilities of the next observation may be suf- 
ficient, maybe even not on every step but in the Cesaro sense. A key observation 
here is that a predictor ps = '^WkHk may be a good predictor not only when 
the data is generated by one of the processes ^fc, fc G N, but when it comes from 
a much larger class. Let us consider this point in more detail. Fix for simplic- 
ity X = {0, 1}. The Laplace predictor A(a::„+i = 0|a;i, . . . , Xn) = *^^^n+\x\^'^'^ 
predicts any Bernoulli i.i.d. process: although convergence in total variation dis- 
tance of conditional probabilities does not hold, predicted probabilities of the 
next outcome converge to the correct ones. Moreover, generalizing the Laplace 
predictor, a predictor can be constructed for the class of all A:-order 
Markov measures, for any given k. As was found by |Ryabko, 1988| , the com- 
bination PR ■= J2'^k^k is a good predictor not only for the the set Uk&iMk 
of all finite-memory processes, but also for any measure fi coming from a much 
larger class: that of all stationary measures on Here prediction is possible 
only in the Cesaro sense (more precisely, pu predicts every stationary process in 
expected time-average Kullback-Leibler divergence, see definitions below) . The 
Laplace predictor itself can be obtained as a Bayes mixture over all Bernoulli 
i.i.d. measures with uniform prior on the parameter p (the probability of 0). 
However, as was observed in [Hutter, 2007| (and as is easy to see), the same 
(asymptotic) predictive properties are possessed by a Bayes mixture with a 
countably supported prior which is dense in [0, 1] (e.g. taking p ■= J2 '^k^k 
where 5k, k S N ranges over all Bernoulli i.i.d. measures with rational proba- 
bility of 0). For a given fc, the set of fc-order Markov processes is parametrized 
by finitely many [0, l]-valued parameters. Taking a dense subset of the values 
of these parameters, and a mixture of the corresponding measures, results in a 
predictor for the class of fc-order Markov processes. Mixing over these (for all 
A: G N) yields, as in [Ryabko, 1988| , a predictor for the class of all stationary pro- 
cesses. Thus, for the mentioned classes of processes, a predictor can be obtained 
as a Bayes mixture of countably many measures in the class. An additional rea- 
son why this kind of analysis is interesting is because of the difficulties arising 
in trying to construct Bayesian predictors for classes of processes that can not 
be easily parametrized. Indeed, a natural way to obtain a predictor for a class 
C of stochastic processes is to take a Bayesian mixture of the class. To do this, 
one needs to define the structure of a probability space on C. If the class C is 
well parametrized, as is the case with the set of all Bernoulli i.i.d. process, then 
one can integrate with respect to the parametrization. In general, when the 
problem lacks a natural parametrization, although one can define the structure 
of the probability space on the set of (all) stochastic processes in many different 
ways, the results one can obtain will then be with probability 1 with respect to 
the prior distribution (see, for example, [Jackson et al., 1999|), while po intwise 



consistency cannot be assured (see e.g. Diaconis and Freedman, 1986| ). Re 
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suits with prior probability 1 can be hard to interpret if one is not sure that the 
structure of the probability space defined on the set C is indeed a natural one 
for the problem at hand (whereas if one does have a natural parametrization, 
then usually results for every value of the parameter can be obtained, as in the 
case with Bernoulli i.i.d. processes mentioned above). The results of the present 
work show that when a predictor exists it can indeed be given as a Bayesian 
predictor, which predicts every (and not almost every) measure in the class, 
while its support is only countable. 

The results. Here we show that if there is a predictor that performs well for 
every measure coming from a class C of processes, then a predictor can also be 
obtained as a convex combination X^^jgfij Wkl^k for some jj^k € C and some Wk > 0, 
k gN. This holds if the prediction quality is measured by either total variation 
distance, or expected average KL divergence: one measure of performance that 
is very strong, the other rather weak. The analysis for the total variation case 
relies on the fact that if p predicts n in total variation distance, then n is 
absolutely continuous with respect to p, so that /5(.Ti..„)//i(.Ti..„) converges to a 
positive number with /x-probability 1 and with a positive p-probability. However, 
if we settle for a weaker measure of performance, such as expected average KL 
divergence, measures p <E C arc typically singular with respect to a predictor 
p. Nevertheless, since p predicts p we can show that p(a;i..„)//x(a;i..„) decreases 
subexponentially with n (with hight probability), and then we can use this 
ratio as an analogue of the density for each time step n, and find a convex 
combination of countably many measures from C that has desired predictive 
properties for each n. Combining these predictors for all n then results in 
a predictor that predicts every /i G C in average KL divergence. The proof 
techniques developed have a potential to be used in solving other questions 
concerning sequence prediction, in particular, the general question of how to 
find a predictor for an arbitrary class C of measures. 

2 Preliminaries 

Let A" be a finite set. The notation xi,,n is used for xi, . . . ,Xn. We consider 
stochastic processes (probability measures) on {X°° , ^) where T is the sigma- 
field generated by the cylinder sets [a;i..„], Xi G X,n G N, where [xi..n\ is the 
set of all infinite sequences that start with xi,,n. For a finite set A denote \A\ 
its cardinality. We use E;;^ for expectation with respect to a measure p. 

Next we introduce the measures of the quality of prediction used in this 
paper. For two measures p and p wc arc interested in how different the p- and 
p-conditional probabilities are, given a data sample xi,,„. Introduce the total 
variation distance 

v{p,p,Xl..n) := sup |p(A|xi..„) - p{A\xi..n)\. 

Definition 1. We say that p predicts /i in total variation if 

v{p,p,xi„n) p-a.s. 
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This convergence is rather strong. In particular, it means that p-conditional 
probabihties of arbitrary far-off events converge to /i-conditional probabihties. 
Moreover, p predicts p, in total variation if [Blackwell and Dubins, 1962] and 
only if [Kalai and Lehrer, 1994| /z is absolutely continuous with respect to p. 

Thus, for a class C of measures there is a predictor p that predicts every 
/X G C in total variation if and only if every /i G C has a density with respect 
to p. Although such sets of processes are rather large, they do not include even 
such basic examples as the set of all Bernoulli i.i.d. processes. That is, there is 
no p that would predict in total variation every Bernoulli i.i.d. process measure 
(5p, p S [0, 1], where p is the probability of 0. Therefore, perhaps for many (if 
not most) practical applications this measure of the quality of prediction is too 
strong, and one is interested in weaker measures of performance. 

For two measures p, and p introduce the expected cumulative Kullback-Leibler 
divergence (KL divergence) as 

dn{p, p) := E;, V V p{xt = a|a;i..t_i) log ^^^^^^ a\xi..t-i) 

Uttx p{xt^a\x^..t^^) 

In words, we take the expected (over data) average (over time) KL divergence 
between p- and p-conditional (on the past data) probability distributions of the 
next outcome. 

Definition 2. We say that p predicts p in expected average KL divergence if 

-dn{p,p) 0. 
n 

This measure of performance is much weaker, in the sense that it requires 
good predictions only one step ahead, and not on every step but only on av- 
erage; also the convergence is not with probability 1 but in expectation. With 
prediction quality so measured, predictors exist for relatively large classes of 
measures; most notably, [Ryabko, 1988| provides a predictor which predicts ev- 
ery stationary process in expected average KL divergence. A simple but useful 
identity that we will need (in the context of sequence prediction introduced also 
in [Ryabko, 1988| ) is the following 

dnip^.p)^- V p{xi..n)l0g (2) 

where on the right-hand side we have simply the KL divergence between mea- 
sures p and p restricted to the first n observations. 

Thus, the results of this work will be established with respect to two very 
different measures of prediction quality, one of which is very strong and the other 
rather weak. This suggests that the facts established reflect some fundamental 
properties of the problem of prediction, rather than those pertinent to particular 
measures of performance. On the other hand, it remains open to extend the 
results below to different measures of performance. 
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3 Main results 



Theorem 1. LetC be a set of probability measures on . If there is a measure 
p such that p predicts every p ^ C in total variation, then there is a sequence 
Pk & C, k € N such that the measure v := J^ken'^kPk predicts every p £ C in 
total variation, where Wk are any positive weights that sum to 1. 

This relatively simple fact can be proven in different ways, relying on the 
equivalence of the statements "p predicts p in total variation distance" and 
is absolutely continuous with respect to p." The proof presented below uses 
techniques that can be then generalized to the case of prediction in expected 
average KL-divergence, where in all interesting cases all measures p G C are 
singular with respect to any predictor that predicts all of them. The idea of the 
proof of Theorem [T] is as follows. For each measure p € C we find the set of 
sequences xi,X2, - ■ ■ on which the density of p with respect to p exists and is 
non-zero. Such a set has /i-probability 1, and, by absolute continuity, a positive 
p-probability. The idea is then to cover the union U^gc?)* "^ith countably many 
of these sets, and then construct a new predictor as a sum of the corresponding 
measures. To find this countable collection of sets T^, we first find a largest (up 
to an El) one with respect p, then the one who has a largest (up to an £2) part 
not covered by the first set, and so on (where Sk are decreasing). Then we show 
that any strictly convex combination of the resulting sequence of measures has 
the property that any measure in C is absolutely continuous with respect to it. 

Proof. We break the (relatively easy) proof of this theorem into 3 steps, which 
will make the (more involved) proof of the next theorem more understandable. 
Step 1: densities. For any p G C, since p predicts p in total variation, p has a 
density (Radon-Nikodym derivative) with respect to p. Thus, for the set of 

all sequences xi, X2, ... G X°° on which ffj.{xi,2^...) > (the limit lim„^oo "| 
exists and is finite and positive) we have p{T^) — 1 and p{T^) > 0. Next we 
will construct a sequence of measures pk G C, fc e N such that the union of the 
sets T^j. has probability 1 with respect to every p e C, and will show that this 
is a sequence of measures whose existence is asserted in the theorem statement. 

Step 2: a countable cover and the resulting predictor. Let Sk ■= 2"*^ and let 
mi :— sup^jg^; p(Tp). Clearly, mi > 0. Find any /ii e C such that p{T^-^) > 
mi— El, and let Ti = 1),^. For fc > 1 define ruk ■— sup^^^^ p{T^\Tk-i). If mj, = 
then define Tk Tk-i, otherwise find any pk such that p(T^^\Tfe_i) > ruk — £fe, 
and let Tk '.— Tk-i U T^^. Define the predictor v as v := X^fceN ^kPk- 

Step 3: V predicts every p € C. Since the sets Ti, T2\Ti, . . . ,Tk\Tk-i, ■ . . 
are disjoint, we must have p(Tk\Tk-i) — > 0, so that m^ 0. Let 

T :— UkeNTk- 

Fix any p E C. Suppose that /i(Tp\T) > 0. Since p is absolutely continuous 
with respect to p, we must have 6 := p{Tf^\T) > 0. Then for every fc > 1 we 
have 

mfe = sup piT^,\Tk-i) > p{T^\Tk-i) > 6 > 0, 
fj.'ec 
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which contradicts ruk — > 0. Thus, wc have shown that 

M(TnT^)-l. (3) 

Let us show that every /i £ C is absolutely continuous with respect to v. 
Indeed, fix any fi G C and suppose > for some A G JT. Then fi'om ^ 

we have fi{A n T) > 0, and, by absolute continuity of fj, with respect to p, also 
p{A n T) > 0. Since T = UfcgN Jfc we must have p{A DTk) > for some fc G N. 
Since on the set Tk the measure pk has non-zero density /^j. with respect to p, 
we must have pkiACiTk) > 0. (Indeed, pkiAnTk) — /^p^^^ ft^k^P > 0.) Hence, 

H^nTk) >wkPk{AnTk) >o, 

so that ^{A) > 0. Thus, p is absolutely continuous with respect to v, and so v 
predicts p in total variation distance. □ 

Theorem 2. LetC be a set of probability measures on X°° . If there is a measure 
p such that p predicts every p € C in expected average KL divergence, then there 
is a sequence pk € C, k G N such that the measure v :— J^kefi'^kPk predicts 
every p E C in expected average KL divergence, where Wk are some positive 
weights. 

A difference worth noting with respect to the formulation of Theorem [T] 
(apart from a different measure of divergence) is in that in the latter the weights 
Wk can be chosen arbitrarily, while in Theorem [2] they can not. In general, the 
statement "X^fcgN '"^fe'^fc predicts p in expected average KL divergence for some 
choice of Wfc, k G N" does not imply w'kVk predicts p in expected average 

KL divergence for every summable sequence of positive w'f.,k E N," while the 
implication trivially holds true if the expected average KL divergence is replaced 
by the total variation. An interesting related question (which is beyond the 
scope of this paper) is how to chose the weights to optimize the behaviour of a 
predictor before asymptotic. 

The idea of the proof is as follows. For every p and every n we consider the 
sets of those xi,,n on which p is greater than p. These sets have to have 
(from some n on) a high probability with respect to p. Then since p predicts 
p in expected average KL divergence, the p-probability of these sets cannot 
decrease exponentially fast (that is, it has to be quite large). (The sequences 
p{xi,,n)/ p{xi,,n), n €N will play the role of densities of the proof of Theorem[Tl 
and the sets the role of sets on which the density is non-zero.) We then 
use, for each given n the same scheme to cover the set with countably many 
T^, as was used in the proof of Theorem [1] to construct a countable covering 
of the set X°° , obtaining for each n a predictor i/„. Then the predictor v 
is obtained as X)nGN '"'n'^"' where the weights decrease subexponentially. The 
latter fact ensures that, although the weights depend on n, they still play no 
role asymptotically. The technically most involved part of the proof is to show 
that the sets in asymptotic have sufficiently large weights in those countable 
covers that we construct for each n. This is used to demonstrate the implication 
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"if a set has a high probabihty then its p-probabihty does not decrease too 
fast, provided some regularity conditions." The proof is broken into the same 
steps as the (simpler) proof of Theorem [U to make the analogy explicit and the 
proof more understandable. 

Proof. Define the weights Wf; :— wk~'^, where w is the normalizer G/tt^. 
Step 1: densities. Define the sets 

T; |xi..„ e A-" : ^l{xl..n) > ■ (4) 

Using Markov's inequality, we derive 

\^J.[Xl..n) J n ^[Xl..n) n 

so that /^(T'^) 1. (Note that if /i is singular with respect to p, as is typically 
the case, then converges to /i-a.e. and one can replace ^ in ([4]) by 1, 

while still having ^(tj;;;) 1.) 

Step 2n: a countable cover, time n. Fix an n € N. Define m" := max^gc p{TJ^) 
(since A"" are finite all suprema are reached) . Find any /i" such that (T^7" ) = 
ml and let T;'„. For k > I, let ml max^ec p{T;^\Tk-i)- If ml > 0, 

let nl be any ^ e C such that p{Tj;^^\TJ}_^) = m^, and let TJ} := TJ}^^ U r;'„; 
otherwise let TJ} := T^_i. Observe that (for each n) there is only a finite num- 
ber of positive m^, since the set X" is finite; let Kn be the largest index k such 
that ml > 0. Let 

Un-='^Wkpl. (6) 

fc=l 

As a result of this construction, for every n G N every k < Kn and every 
Xi..n G using dH) we obtain 

J^n(a:i..«) > Wfe-p(a;i..„). (7) 
n 

Step 2: the resulting predictor. Finally, define 

nGN 

where 7 is the i.i.d. measure with equal probabilities of all a; G A:" (that is, 
7(2^1..™) — 1^1 ~" for every n € N and every xi,,n € A""). We will show that v 
predicts every p € C, and then in the end of the proof (Step r) we will show 
how to replace 7 by a combination of a countable set of elements of C (in fact, 
7 is just a regularizer which ensures that j/-probability of any word is never too 
close to 0). 

Step 3: V predicts every p Cz C. Fix any fi E C. Introduce the param- 
eters el e (0,1), n G N, to be defined later, and let :— f/ej]. Observe 
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that p{T^\T^_i) > /5(7r+i\^fc)' for any fc > 1 and any n 6 N, by defi- 
nition of these sets. Since the sets T]^\TJ^_i, fc e N are disjoint, we obtain 
P{Tk\Tk-i) < Hence, p{TJ^\Tf) < for some j < j"^, since otherwise 

m^" = mux^ec piTl^\T;i) > so that p(r;;.+i\r;!.) > = l/j-, which is a 
contradiction. Thus, 

p{T;:\T-.)<el. (9) 
We can upper-bound fi{TJ^\Tji) as follows. First, observe that 



dnipL.p)^- ^ ^(a;i..„)log 



p{xi..n) 

P{xi..n) 



E, p(x\,,n) 



pixi..„) 

= 1 + 11 + III. (10) 



Then, from ^ we get 

/>-logn. (11) 

Observe that for every n e N and every set A C X", using Jensen's inequality 
we can obtain 

- log - -MA) ^ E ^ ^1(1) M--«) log 

> -m(A) log ^ > -p{A) \ogp{A) - 1. (12) 

Thus, from (HH) and (O we get 

// > -/i(T;\7;"„)iogp(r;\7;"„) - 1/2 > -^(r;\7;';„)iog£;j - 1/2. (13) 

Furthermore, 



m> E A^(^i..n)log/i(a:i..„) > p{X-\T-)\og '^' 



>-l- p{X"\T;:)nlog \X\ > - log \X\, (14) 

where in the second inequality we have used the fact that entropy is maximized 
when all events are equiprobable, in the third one we used |A:'"\r^*| < \X\^\ 
while the last inequality follows from ([5]) . Combining (fTO| with the bounds ([TT|) , 
([13]) and (HI]) we obtain 

dnip,p) > - \ogn - pir;\T;i)\oge^^ - 1 ~log\Xl 
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so that 

m(T;\7;|) < p) + logn + 1 + log lA-l) . (15) 

Since = o(n), we can define tlic parameters ej] in such a way that 

— loge^ = o{n) while at the same time the bound (fT5|) gives fj,{TJ];\Tp„) — o(l). 
Fix such a choice of ej]. Then, using /i(T^) ^ 1, we can conclude 

h{x-\t;\) < p{x-\t;) + ^i{T;\T-^) = o(i). (le) 

We proceed with the proof of dn{^^ i^) = o(n). For any xi..n ^ T'J?; we have 

,,1 ,.1 1,,^ IXJfl UJ / rt-l \ / \ /«— \ 

where the first inequality follows from ([8]), the second from ([7]), and in the equal- 
ity we have used Wjri = w/ (jJJ)^ and — 1/e^. Next we use the decomposition 



dn{n,v) = - ^ /i(a;i..„)log- 



- 5] ^(^,„)log^l(^ =/ + //. (18) 
From (fT7|) we find 

= (l+31ogn-21og£;'-21og«;)+fd„(Ai,p)+ J] Ai(xi..„) log ^^^^i^ 
< o("-) - X! A*(a;i..n)logAi(xi..„) 

2;i..„GA'"\T" 

< o(n) +Ai(A:'"\T")rilog|A'| = o(n), (19) 

where in the second inequality we have used — logeJJ = o(rt) and (i,i(/i, p) = o(n), 
in the last inequality we have again used the fact that the entropy is maximized 
when all events are equiprobable, while the last equality follows from (|16p . 
Moreover, from (15]) we find 



7(^1. .») 



//<log2- ^ /i(xi..„)log 



< 1 + n/i(A'"\r" ) log \X\ = o{n), (20) 
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where in the last inequality we have used 7(2:1. .„) = \X\~'^ and fJ.{xi,,n) < 1, 
and the last equality follows from (|16p . 

From HH]), (HH) and ^ we conclude ^dn{v,fi) 0. 

Step r: the regularizer 7. It remains to show that the i.i.d. regularlizer 7 in 
the definition of 1/ ([H]), can be replaced by a convex combination of a countably 
many elements from C. Indeed, for each n G N, denote 

An := {xi.,n e A:"" : 3^ e C Ai(a;i..„) ^ 0}, 

and let fi^,.-^ ^ := argmax^g^ /i(a;i..n) for each xi ,n G . Define 

for each x'^ „ S n € N, and let 7' := X^fceN ''^feTfe- For every ^ £ C we have 

7'(a;i..n) > W„|^„pV:Ei..„(2;i..n) > W„ | A'p"/^(xi..„) 

for every rt G N and every xi,,n G ^n, which clearly sufhces to establish the 
bound // = o{n) as in □ 



4 Discussion 

For two measures of quality of prediction that we have considered, namely, total 
variation distance and expected average KL divergence, we have shown that if a 
prediction for a class C of measures exists, then a predictor can also be obtained 
as a Bayesian mixture over a countable subset of C. The first possible extension 
of these results that comes to mind is to find out whether the same holds for 
other measures of performance, such as prediction in KL divergence without 
time-averaging, or with probability 1 rather then in expectation. Maybe the 
same results can be obtained in more general formulations, such as /-divergences 
of ICsiszar, 1967] , 

More generally, the questions we addressed in this work are a part of a larger 
problem: given an arbitrary class C of stochastic processes, find the best pre- 
dictor for it. One can approach this problem from other sides. For example, 
the first question one may wish to address is for which classes of processes a 
predictor exists; see [Ryabko, 2008| for some sufficient conditions, such as sepa- 
rability of the class C. Another approach is to identify the conditions which two 
measures /x and p have to satisfy in order for p to predict /x. For prediction in to- 
tal variation such conditions have been identified [Blackwell and Dubins, 1962 
|Kalai and Lehrer, 1994| and, in particular, in the context of the present work, 
they turn out to be very useful. [Kalai and Lehrer, 1994| also provides some 
characterization for the case of a weaker notion of prediction: difference be- 
tween conditional probabilities of the next (several) outcomes (weak merging of 
opinions). In [Ryabko and Hutter, 2008b] some sufficient conditions are found 
for the case of prediction in expected average KL divergence, and prediction in 
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average KL divergence with probability 1. Of course, another very natural ap- 
proach to the general problem posed above is to try and find predictors (in the 
form of algorithms) for some particular classes of processes which are of prac- 
tical interest. Towards this end, the contribution of this work is in providing a 
specific form that some solution to this question has to have, if a solution ex- 
ists: a Bayesian predictor whose prior is concentrated on a countable set. This 
is perhaps a rather simple form, which may be useful for constructing practical 
algorithms. 
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